Libraries used

These are the libraries I have used in order to complete the project

library(ggplot2)
library(corrplot)
library(Stack)
library(fastDummies)
library(dplyr)
library(leaflet)

Import the data

The data is uploaded on a gist, and we store it in two data.frame objects called train and test

train <- read.csv("https://gist.githubusercontent.com/alombatti/f2cb8f1a244784999353309a97a6777f/raw/f38ee90641da97f62964ac10ea558564bfc6cb65/house_price_train.csv")
test <- read.csv("https://gist.githubusercontent.com/alombatti/1df87e336f159842e64ced9380b13e25/raw/49ea0fe6b42fb5d4868e1291dc8e47b20d2cc3bd/house_price_test.csv")

Data understanding

This dataset contains house sale prices for King County, which includes Seattle, Washington. It includes homes sold between May 2014 and May 2015. The dataset provided to us has already been devided in train set and test set, in the sizes of 17277 observations for the train and 4320 observations for the test.

The fields describing the data are:

  1. id: a notation for a house
  2. date: the date the house was sold
  3. price: the prediction target, consisting in the amount in US Dollars the house was sold at
  4. bedrooms: the number of bedrooms in the house
  5. bathrooms: the number of bathrooms in the house
  6. sqft_living: square footage of the home
  7. sqft_lot: square footage of the lot
  8. floors: total floors (levels) in the house
  9. waterfront: use which has a view to the waterfront
  10. view: how many times the house has been viewed
  11. condition: overall condition of the house
  12. grade: overall grade given to the housing unit, based on King County grading system
  13. sqft_above: square footage of house apart from basement
  14. sqft_basement: square footage of the basement
  15. _yr_built: the year in which the house was completed
  16. yr_renovated: year when house was renovated
  17. zipcode: zip
  18. lat: latitude coordinate
  19. long: longitude coordinate
  20. sqft_living15: living room area in 2015 (if different fron sqft_living, it implies some renovations)
  21. sqft_lot15: lot area in 2015 (if different from sqft_lot, it implies some renovations)

Here is a glimpse of what the STRUCTURE of the data looks like (for the train set)

str(train)
## 'data.frame':    17277 obs. of  21 variables:
##  $ id           : num  9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
##  $ date         : Factor w/ 372 levels "1/10/2015","1/12/2015",..: 211 331 290 20 226 223 94 34 76 294 ...
##  $ price        : num  225000 641250 810000 330000 530000 ...
##  $ bedrooms     : int  3 3 4 4 4 4 4 3 4 3 ...
##  $ bathrooms    : num  1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
##  $ sqft_living  : int  1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
##  $ sqft_lot     : int  7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
##  $ floors       : num  1 3 2 1 1 2 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 4 4 3 3 3 3 3 ...
##  $ grade        : int  7 10 9 7 7 9 10 8 9 8 ...
##  $ sqft_above   : int  1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
##  $ sqft_basement: int  0 0 0 0 870 640 0 310 0 1040 ...
##  $ yr_built     : int  1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
##  $ lat          : num  47.4 47.7 47.6 47.8 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
##  $ sqft_lot15   : int  7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...

We then check for NAs in the data we are going to use, both in the train and in the test.

print(length(which(is.na(train) == T)))
## [1] 0
print(length(which(is.na(test) == T)))
## [1] 0

As we can see, the data does not contain any null values, so we can proceed with our analysis.

Transformation of the DATE field

In order to analyze the field DATE, and to be able to determine whether it is relevant in our analysis, we have to transform it so that we can use it. I thought of separating the fields year, month, and day, to, further in the analysis, determine if any of them is directly correlated with the price.

First, I change the field date from factor to date, and I store in a new dataframe the complete field of the date, its year, its month, and its day. Then, I merge the newly created dataset with the one from which I extracted the date field, and I reorder the column in a way that makes sense for the analyais.

I repeat the same procedure for the test and the train datasets.

train$date <- as.Date(train$date, "%m/%d/%Y")
df.date_train <- data.frame(date = train$date,
                      year = as.numeric(format(train$date, format = "%Y")),
                      month = as.numeric(format(train$date, format = "%m")),
                      day = as.numeric(format(train$date, format = "%d")))

train <- unique(merge(train, df.date_train))
rm(df.date_train)
train <- train[, c(2, 1, 22, 23, 24, 3:21)]
test$date <- as.Date(test$date, "%m/%d/%Y")
df.date_test <- data.frame(date = test$date,
                           year = as.numeric(format(test$date, format = "%Y")),
                           month = as.numeric(format(test$date, format = "%m")),
                           day = as.numeric(format(test$date, format = "%d")))

test <- unique(merge(test, df.date_test))
rm(df.date_test)
test <- test[, c(2, 1, 21, 22, 23, 3:20)]

Now I stack together the train and test dataset to procede with a more accurate analysis. I also eliminate the field id and date because they are not going to be relevant.

complete <- Stack(train, test)
complete$id = NULL
complete$date = NULL

The dataset now looks like this:

head(complete)
##   year month day bedrooms bathrooms sqft_living sqft_lot floors waterfront
## 1 2014     5   2        3      2.50        1270     1180    3.0          0
## 2 2014     5   2        3      2.25        1970    35100    2.0          0
## 3 2014     5   2        3      2.50        2090    10834    1.0          0
## 4 2014     5   2        3      2.50        1770     1235    3.0          0
## 5 2014     5   2        3      3.00        1850    19966    1.0          0
## 6 2014     5   2        3      2.00        2710     4500    1.5          0
##   view condition grade sqft_above sqft_basement yr_built yr_renovated
## 1    0         3     8       1270             0     2001            0
## 2    0         4     9       1970             0     1977            0
## 3    0         4     8       1360           730     1987            0
## 4    0         3     8       1600           170     2007            0
## 5    0         4     7       1090           760     1992            0
## 6    0         4     8       1880           830     1929            0
##   zipcode     lat     long sqft_living15 sqft_lot15  price
## 1   98107 47.6697 -122.392          1320       1180 445700
## 2   98027 47.4635 -121.991          2340      35100 437500
## 3   98003 47.3537 -122.303          1750       8595 285000
## 4   98103 47.6965 -122.342          1680       1203 436110
## 5   98038 47.3493 -122.034          1410       6715 287200
## 6   98115 47.6747 -122.295          2060       4500 805000

Correlation matrix

Now I plot a correlation graph to see how much the different variables are correlated to our target variable PRICE

corr = cor(train[, 3:24])

corrplot(corr, method = "color",
         outline = T,
         cl.pos = "n",
         rect.col = "black",
         tl.col = "indianred4",
         addCoef.col = "black",
         number.digits = 2,
         number.cex = 0.60,
         tl.cex = 0.7,
         cl.cex = 1,
         col = colorRampPalette(c("red", "white", "green4")) (100))

From the matrix, we can see that we have many variables correlated somehow with price, some more and some less. I decided to proceed my analysis keeoing only the variables that have a correlation >= 0.15 with the target variable.

Therefore, going forward I will only keep:

  1. price
  2. bedrooms
  3. bathrooms
  4. sqft_living
  5. floors
  6. waterfront
  7. view
  8. grade
  9. sqft_above
  10. sqft_basement
  11. lat
  12. sqft_living15

Nonetheless, some considerations have to be made. The variables condition, yr_renovated, and zipcode are going to be converted as factors further on in the analysis. Therefore, we still include them in the listof variables we are going to keep because we can not really interepret their value now since it is numeric, and as numeric it might not affect the correlation with price.

complete2 <- select(complete, "price", "bedrooms", "bathrooms", "sqft_living", "floors",
                    "waterfront", "view", "grade", "sqft_above", "sqft_basement", "lat",
                    "sqft_living15", "condition", "yr_renovated", "zipcode")

Data Visualization

This is an interactive map with all the houses based on their geographical position, coloured by their price.

complete$PriceBin<-cut(complete$price, c(0,250e3,500e3,750e3,1e6,2e6,999e6))

center_lon = median(complete$long,na.rm = TRUE)
center_lat = median(complete$lat,na.rm = TRUE)

factpal <- colorFactor(c("black","blue","yellow","orange","#0B5345","red"), 
                      complete$PriceBin)



leaflet(complete) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircles(lng = ~long, lat = ~lat, 
             color = ~factpal(PriceBin))  %>%
  setView(lng=center_lon, lat=center_lat,zoom = 12) %>%
  
  addLegend("bottomright", pal = factpal, values = ~PriceBin,
            title = "House Price Distribution",
            opacity = 1)

In this section we are going to analyze the relationship of all the varaibles with our target, the variable price. Before doing this, I create a copy of the dataset, and I split it in train and test, based on the value of the variable price.

complete2 <- complete

train2 <- split(complete2, complete2$price > 0)
train2 <- train2[["TRUE"]]

test2 <- split(complete2, is.na(complete2$price))
test2 <- test2[["TRUE"]]

Distribution of prices

Here we look at the histrogram showing how many houses have been sold for each price.

graph1 <- ggplot(data = train2, aes(x = price)) +    
  geom_histogram(alpha = 0.8, fill = "#F1C40F") +
  labs(x = "price", y = "houses", title = "Distribution of Prices") +
  theme_bw()

graph1

Since the interval of price of the houses is very wide, it is smart to focus only on a small segment, to see the patten where most of the houses are. In this histogram, we only see the distribution of houses sold at a price up to 1.5 million US Dollars.

graph2 <- ggplot(data = train2, aes(x = price)) +    
  geom_histogram(alpha = 0.8, fill = "#F1C40F") +
  scale_x_continuous(limits=c(0, 1.5e6)) +
  labs(x = "price", y = "houses", title = "Distribution of Prices (up to 1.5 million)") +
  theme_bw()

graph2


For the sole purpose of data visualization, I now add a comuln to my train dataframe rapresenting the logarithmic transformation of the price, so that the graphs will look prettier and more understandable.

train2$logprice = log(train2$price)

Boxplot between price and bedrooms

graph3 <- boxplot(train2[, "logprice"] ~ train2[, "bedrooms"],
                  main = "Price vs Bedrooms", col=c("#F1C40F","#336633"),
                  xlab="bedrooms", ylab="log (price)")

From this graph, we notice there are two odd values, when the bedrooms are 11 and when the bedrooms are 33.

print(subset(train2, train2$bedrooms > 10))
##      year month day bedrooms bathrooms sqft_living sqft_lot floors
## 2866 2014     6  25       33      1.75        1620     6000      1
## 6003 2014     8  21       11      3.00        3000     4960      2
##      waterfront view condition grade sqft_above sqft_basement yr_built
## 2866          0    0         5     7       1040           580     1947
## 6003          0    0         3     7       2400           600     1918
##      yr_renovated zipcode     lat     long sqft_living15 sqft_lot15  price
## 2866            0   98103 47.6878 -122.331          1330       4700 640000
## 6003         1999   98106 47.5560 -122.363          1420       4960 520000
##             PriceBin logprice
## 2866 (5e+05,7.5e+05] 13.36922
## 6003 (5e+05,7.5e+05] 13.16158

In fact, we can see all the houses with a number of bedrooms bigger than 10 is just two, therefore we exclude these records from our analysis not to disturb our model.

Boxplot between price and bathrooms

graph4 <- boxplot(train2[, "logprice"] ~ train2[, "bathrooms"],
                  main = "Price vs Bathrooms", col=c("#F1C40F","#336633"),
                  xlab="bathrooms", ylab="log (price)")

The relationship looks linear overall. Nonetheless, the price of the house with 7.5 bathrooms is relatively lower than its neighbours. Therefore, we decide to eliminate that value.

Boxplot between price and floors

graph5 <- boxplot(train2[, "logprice"] ~ train2[, "floors"],
                  main = "Price vs Floors", col=c("#F1C40F","#336633"),
                  xlab="floors", ylab="log (price)")

The relationship is linear.

Boxplot between price and waterfront

graph6 <- boxplot(train2[, "logprice"] ~ train2[, "waterfront"],
                  main = "Price vs Waterfront", col=c("#F1C40F","#336633"),
                  xlab="waterfront", ylab="log (price)")

The relationsip is linear. Here, the 0 rapresents the houses without a waterfront, while the value 1 rapresents the houses with a waterfront. It is common sense that, if the house has a waterfront, its price is higher.

Boxplot between price and view

graph7 <- boxplot(train2[, "logprice"] ~ train2[, "view"],
                  main = "Price vs View", col=c("#F1C40F","#336633"),
                  xlab="view", ylab="log (price)")

The relationsip is linear. Here, the 0 rapresents the worst view, while the value 4 rapresents the best view. It is common sense that, the more beautiful the view, the higher the price of the house.

Boxplot between price and condition

graph8 <- boxplot(train2[, "logprice"] ~ train2[, "condition"],
                  main = "Price vs Condition", col=c("#F1C40F","#336633"),
                  xlab="condition", ylab="log (price)")

The relationship is linear.

Boxplot between price and grade

graph9 <- boxplot(train2[, "logprice"] ~ train2[, "grade"],
                  main = "Price vs Grade", col=c("#F1C40F","#336633"),
                  xlab="grade", ylab="log (price)")

The relationship is linear.

Boxplot between price and sqft_living

graph10 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_living"],
                  main = "Price vs sqft_Living", col=c("#F1C40F","#336633"),
                  xlab="sqft_living", ylab="log (price)")

The relationship looks linear. The bigger the living room, the higher the price of the house.

Boxplot between price and sqft_basement

graph11 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_basement"],
                   main = "Price vs sqft_Basement", col=c("#F1C40F","#336633"),
                   xlab="sqft_basement", ylab="log (price)")

This graph is interesting. Here, we can see that a lot of houses do not have a basement at all, while tbe houses that have a basement, even if the size of the basement changes a lot, their prices do not really change.

length(train2$sqft_basement[train2$sqft_basement == 0])
## [1] 10521
length(train2$sqft_basement[train2$sqft_basement != 0])
## [1] 6756

Therefore, I decided to simplify this field and make it a boolean: it will take value 0 if the house does not have a basement, and it will take value 1 if the house has a basement, no matter its size.

Boxplot between price and yr_renovated

graph12 <- boxplot(train2[, "logprice"] ~ train2[, "yr_renovated"],
                   main = "Price vs yr_Renovated", col=c("#F1C40F","#336633"),
                   xlab="yr_renovated", ylab="log (price)")

This graph is very similar to the previous one regarding the basement situation. As for having a basement or not, the same reasoning can be applied to the renovation of the house. It does not really matter when the house has been renovated, but what matters is if the house has been renovated at all.

length(train2$yr_renovated[train2$yr_renovated == 0])
## [1] 16538
length(train2$yr_renovated[train2$yr_renovated != 0])
## [1] 739

As we can see, most of the houses have not been renovated, so we will transform this field into boolean.

Boxplot between price and zipcode

graph13 <- boxplot(train2[, "logprice"] ~ train2[, "zipcode"],
                   main = "Price vs Zipcode", col=c("#F1C40F","#336633"),
                   xlab="zipcode", ylab="log (price)")

We can not really see a pattern in this graph, so we will categorize the variable when modelling (as predicted before).

Boxplot between price and sqft_living15

graph14 <- boxplot(train2[, "logprice"] ~ train2[, "sqft_living15"],
                   main = "Price vs sqft_Living(15)", col=c("#F1C40F","#336633"),
                   xlab="sqft_living(15)", ylab="log (price)")

As for the variable sqft_living, we see a direct relationship in this graph. As the area of the living room increases, the price ofthe house increases as well.




Given all the insights from the data visualization section, we can proceed in apply the changes to our dataset.

First, as seen in the graph for the bathrooms, we eliminate the bathroom with value 7.5 because its price was odd and probably it was a mistake.

train2 <- train2[-5419, ]

Then, as seen in the graph for the bedrooms, we only keep those house with 10 or less bedrooms, since the two houses with 11 and 33 bedrooms had wrong values for their price.

train2 <- train2[train2$bedrooms <= 10, ]

We dummify the sqft_basement field, transforming all the houses with a basement with value 1, and keeping all the houses without a basement with value 0.

train2$sqft_basement[train2$sqft_basement != 0] = 1
test2$sqft_basement[test2$sqft_basement != 0] = 1

We dummify the yr_renovated field, transforming all the houses that has been renovated with value 1, and keeping all the houses without renovation with value 0.

train2$yr_renovated[train2$yr_renovated != 0] = 1
test2$yr_renovated[test2$yr_renovated != 0] = 1



Lastly, we stack back together the train and test set.

train2$logprice <- NULL
complete2 <- Stack(train2, test2)

Data Transformation

I create a copy of the complete dataset to proceed with the analysis.

complete3 <- complete2

Now, it is time to standardize the variables involving the squarefeet. I create a function that normalizes the value of the field between 0 and 1, 0 being the lowest value and 1 being the highest.

range01 <- function(x) {
  (x-min(x)) / (max(x)-min(x))
}

Therefore, I can proceed with the normalization of the variables sqft_living, sqft_above, and sqft_living15.

complete3$sqft_living <- range01(complete3$sqft_living)
complete3$sqft_above <- scale(complete3$sqft_above)
complete3$sqft_living15 <- scale(complete3$sqft_living15)

In order to proceed with the design of the model, I transform as factor the following variables because they do not make sense as numeric in this context:

  1. bathrooms
  2. floors
  3. waterfront
  4. view
  5. grade
  6. sqft_basement
  7. yr_renovated
  8. condition
  9. zipcode
complete3$bedrooms <- as.factor(complete2$bedrooms)
complete3$bathrooms <- as.factor(complete2$bathrooms)
complete3$floors <- as.factor(complete2$floors)
complete3$waterfront <- as.factor(complete2$waterfront)
complete3$view <- as.factor(complete2$view)
complete3$grade <- as.factor(complete2$grade)
complete3$sqft_basement <- as.factor(complete3$sqft_basement)
complete3$yr_renovated <- as.factor(complete3$yr_renovated)
complete3$condition <- as.factor(complete3$condition)
complete3$zipcode <- as.factor(complete3$zipcode)

Another variable that does not make sense to take as numeric is lat. Nonetheless, we can not consider it as factor neither because of the too many values it takes.

Therefore, we bin the factor in 13 intervals.

breaks_lat <- c(-Inf, 47.20, 47.25, 47.30, 47.35, 47.40, 47.45, 47.50, 47.55, 47.60,
                47.65, 47.70, 47.75, Inf)

names_lat <- c("47.15 - 47.20", "47.20 - 47.25", "47.25 - 47.30", "47.30 - 47.35",
               "47.35 - 47.40", "47.40 - 47.45", "47.45 - 47.50", "47.50 - 47.55",
               "47.55 - 47.60", "47.60 - 47.65", "47.65 - 47.70", "47.70 - 47.75",
               "47.75 - 47.80")

complete3$lat <- cut(complete3$lat, breaks = breaks_lat, labels = names_lat)

In the end, I unstack the train and test dataset from the complete dataset to build our model.

train3 <- split(complete3, complete3$price > 0)
train3 <- train3[["TRUE"]]

test3 <- split(complete3, is.na(complete3$price))
test3 <- test3[["TRUE"]]

Model

I decided to use a linear regression model on the train dataset with all the variables available.

model1 <- lm(formula = price ~ bedrooms +
               bathrooms +
               sqft_living +
               floors +
               waterfront +
               view +
               grade +
               sqft_above +
               sqft_basement +
               sqft_living15 +
               lat +
               condition +
               yr_renovated +
               zipcode,
               data = train3)

The model scores an R2 of 0.84, which is a relatively good result.

summary(model1)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + grade + sqft_above + sqft_basement + 
##     sqft_living15 + lat + condition + yr_renovated + zipcode, 
##     data = train3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2190513   -59233     1561    54439  3580444 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       105895.9   175782.5   0.602 0.546899    
## bedrooms2          18705.3    12352.9   1.514 0.129980    
## bedrooms3          27431.6    12391.9   2.214 0.026865 *  
## bedrooms4          11235.3    12678.5   0.886 0.375539    
## bedrooms5           3750.6    13377.6   0.280 0.779201    
## bedrooms6         -41203.4    16297.0  -2.528 0.011471 *  
## bedrooms7        -175298.3    30625.6  -5.724 1.06e-08 ***
## bedrooms8         -28839.2    48513.3  -0.594 0.552213    
## bedrooms9        -216616.1    85972.9  -2.520 0.011758 *  
## bedrooms10       -307331.1    85934.1  -3.576 0.000349 ***
## bathrooms0.75       1636.6    85908.6   0.019 0.984801    
## bathrooms1         37925.3    83612.5   0.454 0.650134    
## bathrooms1.25     -22304.3    98047.6  -0.227 0.820050    
## bathrooms1.5       29613.1    83754.4   0.354 0.723665    
## bathrooms1.75      28746.2    83715.0   0.343 0.731316    
## bathrooms2         28233.3    83743.4   0.337 0.736015    
## bathrooms2.25      46906.5    83780.1   0.560 0.575571    
## bathrooms2.5       35350.1    83764.5   0.422 0.673017    
## bathrooms2.75      33247.7    83867.8   0.396 0.691793    
## bathrooms3         48116.1    83945.0   0.573 0.566526    
## bathrooms3.25     101517.8    84062.0   1.208 0.227197    
## bathrooms3.5       64362.4    84026.3   0.766 0.443699    
## bathrooms3.75     115602.0    84931.2   1.361 0.173492    
## bathrooms4        120992.4    85047.9   1.423 0.154859    
## bathrooms4.25     218310.3    86011.1   2.538 0.011152 *  
## bathrooms4.5      175216.7    85570.7   2.048 0.040612 *  
## bathrooms4.75     457794.3    90557.1   5.055 4.34e-07 ***
## bathrooms5        144406.9    92664.9   1.558 0.119162    
## bathrooms5.25     358576.1    95232.3   3.765 0.000167 ***
## bathrooms5.5      808789.4   104722.2   7.723 1.20e-14 ***
## bathrooms5.75     -69956.3   119726.7  -0.584 0.559026    
## bathrooms6        449460.3   112709.5   3.988 6.70e-05 ***
## bathrooms6.25     614990.0   135350.9   4.544 5.57e-06 ***
## bathrooms6.5      135972.1   133143.5   1.021 0.307153    
## bathrooms6.75    1322542.4   170028.0   7.778 7.76e-15 ***
## bathrooms7.75    2743916.9   177705.2  15.441  < 2e-16 ***
## bathrooms8       1687262.9   138434.1  12.188  < 2e-16 ***
## sqft_living      1485330.3    68162.2  21.791  < 2e-16 ***
## floors1.5          -6877.5     4423.3  -1.555 0.120000    
## floors2           -29278.9     3834.0  -7.637 2.35e-14 ***
## floors2.5          83146.2    13871.8   5.994 2.09e-09 ***
## floors3           -90689.4     7967.4 -11.383  < 2e-16 ***
## floors3.5         124630.2    73596.4   1.693 0.090392 .  
## waterfront1       604170.1    16368.4  36.911  < 2e-16 ***
## view1              87258.3     9293.6   9.389  < 2e-16 ***
## view2              68934.3     5657.9  12.184  < 2e-16 ***
## view3             152822.4     7670.1  19.924  < 2e-16 ***
## view4             271543.5    11897.7  22.823  < 2e-16 ***
## grade4           -164813.7   149361.9  -1.103 0.269846    
## grade5           -202272.5   146917.7  -1.377 0.168600    
## grade6           -211491.1   146773.2  -1.441 0.149621    
## grade7           -201562.8   146813.1  -1.373 0.169795    
## grade8           -172201.0   146855.0  -1.173 0.240976    
## grade9            -93833.1   146923.5  -0.639 0.523057    
## grade10            23433.4   147033.7   0.159 0.873376    
## grade11           198133.1   147286.2   1.345 0.178570    
## grade12           468331.2   148324.1   3.157 0.001594 ** 
## grade13          1156590.5   155802.3   7.423 1.19e-13 ***
## sqft_above         35642.1     4755.1   7.496 6.92e-14 ***
## sqft_basement1     -6611.7     4419.0  -1.496 0.134620    
## sqft_living15      13102.0     2032.5   6.446 1.18e-10 ***
## lat47.20 - 47.25    7976.0    22797.1   0.350 0.726443    
## lat47.25 - 47.30   56876.0    36726.3   1.549 0.121486    
## lat47.30 - 47.35   48007.2    37605.5   1.277 0.201761    
## lat47.35 - 47.40   53472.8    38929.6   1.374 0.169591    
## lat47.40 - 47.45   97113.1    40066.6   2.424 0.015370 *  
## lat47.45 - 47.50   93291.1    40941.2   2.279 0.022699 *  
## lat47.50 - 47.55  121299.2    41577.1   2.917 0.003534 ** 
## lat47.55 - 47.60  169494.9    42026.6   4.033 5.53e-05 ***
## lat47.60 - 47.65  247503.0    43405.0   5.702 1.20e-08 ***
## lat47.65 - 47.70  230325.0    44365.3   5.192 2.11e-07 ***
## lat47.70 - 47.75  216039.8    45131.8   4.787 1.71e-06 ***
## lat47.75 - 47.80  190043.3    45684.1   4.160 3.20e-05 ***
## condition2         76732.9    33522.7   2.289 0.022093 *  
## condition3         85444.5    31266.7   2.733 0.006287 ** 
## condition4        110998.5    31289.9   3.547 0.000390 ***
## condition5        162689.7    31472.7   5.169 2.38e-07 ***
## yr_renovated1      64027.8     5623.8  11.385  < 2e-16 ***
## zipcode98002         367.1    14042.4   0.026 0.979145    
## zipcode98003       -3880.8    12776.1  -0.304 0.761316    
## zipcode98004      580905.4    23743.0  24.466  < 2e-16 ***
## zipcode98005      153589.0    24631.9   6.235 4.61e-10 ***
## zipcode98006      143669.1    21462.9   6.694 2.24e-11 ***
## zipcode98007       84230.0    25535.1   3.299 0.000974 ***
## zipcode98008       83311.3    24383.3   3.417 0.000635 ***
## zipcode98010       55214.7    18202.0   3.033 0.002421 ** 
## zipcode98011      -12904.4    28974.6  -0.445 0.656057    
## zipcode98014      -75444.1    28096.9  -2.685 0.007257 ** 
## zipcode98019      -61668.2    28893.3  -2.134 0.032829 *  
## zipcode98022       33876.6    33716.7   1.005 0.315035    
## zipcode98023      -33148.6    11185.0  -2.964 0.003044 ** 
## zipcode98024       57092.2    26492.8   2.155 0.031175 *  
## zipcode98027       95121.2    20681.0   4.599 4.27e-06 ***
## zipcode98028      -17049.8    28244.6  -0.604 0.546085    
## zipcode98029      108310.9    22223.4   4.874 1.10e-06 ***
## zipcode98030        5676.3    16214.6   0.350 0.726287    
## zipcode98031      -14362.7    17173.3  -0.836 0.402976    
## zipcode98032        -829.5    19549.4  -0.042 0.966157    
## zipcode98033      179603.1    25982.5   6.912 4.93e-12 ***
## zipcode98034       36558.3    27121.7   1.348 0.177696    
## zipcode98038       31100.9    14011.6   2.220 0.026455 *  
## zipcode98039      929172.6    34264.4  27.118  < 2e-16 ***
## zipcode98040      380841.4    22304.4  17.075  < 2e-16 ***
## zipcode98042       -2848.7    13660.6  -0.209 0.834815    
## zipcode98045       54286.1    20428.1   2.657 0.007881 ** 
## zipcode98052       61182.8    25376.6   2.411 0.015920 *  
## zipcode98053       26017.6    25754.1   1.010 0.312398    
## zipcode98055       -2589.0    19429.9  -0.133 0.893997    
## zipcode98056       27351.7    20465.5   1.336 0.181410    
## zipcode98058      -11873.8    18124.1  -0.655 0.512388    
## zipcode98059       31144.1    19913.2   1.564 0.117836    
## zipcode98065       22135.3    21723.0   1.019 0.308226    
## zipcode98070       -3592.3    20653.5  -0.174 0.861922    
## zipcode98072       10085.5    28235.0   0.357 0.720947    
## zipcode98074      -19412.8    24380.1  -0.796 0.425895    
## zipcode98075       38633.3    22365.0   1.727 0.084115 .  
## zipcode98077      -26381.3    28836.7  -0.915 0.360282    
## zipcode98092      -21374.9    11968.1  -1.786 0.074119 .  
## zipcode98102      302381.2    28465.6  10.623  < 2e-16 ***
## zipcode98103      169860.9    25820.2   6.579 4.89e-11 ***
## zipcode98105      307349.9    27131.9  11.328  < 2e-16 ***
## zipcode98106       51400.0    21620.4   2.377 0.017447 *  
## zipcode98107      183218.4    26996.1   6.787 1.18e-11 ***
## zipcode98108       28248.8    23301.5   1.212 0.225408    
## zipcode98109      345896.0    28032.5  12.339  < 2e-16 ***
## zipcode98112      426548.9    25228.2  16.908  < 2e-16 ***
## zipcode98115      165556.3    25877.0   6.398 1.62e-10 ***
## zipcode98116      179893.9    22539.4   7.981 1.54e-15 ***
## zipcode98117      153474.1    25925.4   5.920 3.28e-09 ***
## zipcode98118       73837.3    21042.9   3.509 0.000451 ***
## zipcode98119      298179.9    26375.5  11.305  < 2e-16 ***
## zipcode98122      146699.0    25214.8   5.818 6.06e-09 ***
## zipcode98125       38954.9    27442.8   1.419 0.155773    
## zipcode98126      110094.4    21517.7   5.116 3.15e-07 ***
## zipcode98133        7186.8    27387.5   0.262 0.793008    
## zipcode98136      179475.3    22225.1   8.075 7.17e-16 ***
## zipcode98144      163387.1    22439.2   7.281 3.45e-13 ***
## zipcode98146       29519.7    21014.6   1.405 0.160121    
## zipcode98148       18124.9    25968.9   0.698 0.485218    
## zipcode98155       -9094.5    27686.4  -0.328 0.742552    
## zipcode98166       28217.7    19921.4   1.416 0.156662    
## zipcode98168        3206.1    20737.4   0.155 0.877134    
## zipcode98177       79032.8    28409.6   2.782 0.005410 ** 
## zipcode98178      -23483.6    20998.9  -1.118 0.263445    
## zipcode98188      -14773.7    21527.7  -0.686 0.492554    
## zipcode98198      -20536.8    16450.3  -1.248 0.211894    
## zipcode98199      210995.7    25188.8   8.377  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 144300 on 17127 degrees of freedom
## Multiple R-squared:  0.8464, Adjusted R-squared:  0.845 
## F-statistic: 646.2 on 146 and 17127 DF,  p-value: < 2.2e-16

I notice that the variable sqft_basement is not significant, so I re-model without that variable.

model2 <- lm(formula = price ~ bedrooms +
               bathrooms +
               sqft_living +
               floors +
               waterfront +
               view +
               grade +
               sqft_above +
               sqft_living15 +
               lat +
               condition +
               yr_renovated +
               zipcode,
               data = train3)
summary(model2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + grade + sqft_above + sqft_living15 + 
##     lat + condition + yr_renovated + zipcode, data = train3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2190582   -59141     1648    54520  3583476 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       113830.4   175708.9   0.648 0.517100    
## bedrooms2          18314.4    12350.5   1.483 0.138124    
## bedrooms3          26826.7    12385.8   2.166 0.030330 *  
## bedrooms4          10704.3    12674.0   0.845 0.398351    
## bedrooms5           3461.4    13376.7   0.259 0.795819    
## bedrooms6         -41118.7    16297.5  -2.523 0.011645 *  
## bedrooms7        -174762.9    30624.6  -5.707 1.17e-08 ***
## bedrooms8         -27843.7    48510.5  -0.574 0.565993    
## bedrooms9        -215650.8    85973.6  -2.508 0.012139 *  
## bedrooms10       -306878.0    85936.7  -3.571 0.000357 ***
## bathrooms0.75       2854.0    85907.9   0.033 0.973498    
## bathrooms1         39490.9    83609.0   0.472 0.636699    
## bathrooms1.25     -21275.3    98048.7  -0.217 0.828221    
## bathrooms1.5       30505.9    83755.3   0.364 0.715693    
## bathrooms1.75      29472.3    83716.6   0.352 0.724806    
## bathrooms2         29266.7    83743.6   0.349 0.726734    
## bathrooms2.25      47229.1    83782.8   0.564 0.572960    
## bathrooms2.5       36145.9    83765.9   0.432 0.666102    
## bathrooms2.75      33780.6    83870.1   0.403 0.687120    
## bathrooms3         48649.3    83947.3   0.580 0.562244    
## bathrooms3.25     101589.6    84065.0   1.208 0.226885    
## bathrooms3.5       64331.5    84029.3   0.766 0.443935    
## bathrooms3.75     115765.8    84934.2   1.363 0.172898    
## bathrooms4        121190.7    85050.8   1.425 0.154198    
## bathrooms4.25     218580.3    86014.1   2.541 0.011056 *  
## bathrooms4.5      175160.7    85573.8   2.047 0.040684 *  
## bathrooms4.75     458082.7    90560.2   5.058 4.27e-07 ***
## bathrooms5        145126.9    92667.0   1.566 0.117341    
## bathrooms5.25     358749.2    95235.7   3.767 0.000166 ***
## bathrooms5.5      809693.9   104724.2   7.732 1.12e-14 ***
## bathrooms5.75     -72486.5   119719.1  -0.605 0.544874    
## bathrooms6        452155.2   112699.2   4.012 6.04e-05 ***
## bathrooms6.25     615522.3   135355.4   4.547 5.47e-06 ***
## bathrooms6.5      134140.4   133142.7   1.007 0.313712    
## bathrooms6.75    1341142.1   169579.1   7.909 2.76e-15 ***
## bathrooms7.75    2737817.5   177664.8  15.410  < 2e-16 ***
## bathrooms8       1696080.5   138313.6  12.263  < 2e-16 ***
## sqft_living      1412948.9    48020.2  29.424  < 2e-16 ***
## floors1.5          -6775.6     4422.9  -1.532 0.125557    
## floors2           -29217.1     3833.9  -7.621 2.65e-14 ***
## floors2.5          82601.3    13867.5   5.956 2.63e-09 ***
## floors3           -89419.3     7922.3 -11.287  < 2e-16 ***
## floors3.5         126311.6    73590.5   1.716 0.086105 .  
## waterfront1       603760.2    16366.7  36.890  < 2e-16 ***
## view1              87146.1     9293.6   9.377  < 2e-16 ***
## view2              68955.2     5658.1  12.187  < 2e-16 ***
## view3             153303.9     7663.6  20.004  < 2e-16 ***
## view4             272209.2    11889.8  22.894  < 2e-16 ***
## grade4           -164197.8   149366.7  -1.099 0.271656    
## grade5           -202224.7   146923.0  -1.376 0.168716    
## grade6           -211707.9   146778.5  -1.442 0.149218    
## grade7           -202115.4   146818.0  -1.377 0.168642    
## grade8           -172827.4   146859.7  -1.177 0.239284    
## grade9            -94414.4   146928.3  -0.643 0.520500    
## grade10            22656.2   147038.1   0.154 0.877545    
## grade11           197241.4   147290.3   1.339 0.180545    
## grade12           467112.1   148327.2   3.149 0.001640 ** 
## grade13          1155883.0   155807.2   7.419 1.24e-13 ***
## sqft_above         40930.0     3181.3  12.866  < 2e-16 ***
## sqft_living15      13026.7     2031.9   6.411 1.48e-10 ***
## lat47.20 - 47.25    7622.9    22796.7   0.334 0.738092    
## lat47.25 - 47.30   56801.8    36727.6   1.547 0.121986    
## lat47.30 - 47.35   47898.8    37606.8   1.274 0.202797    
## lat47.35 - 47.40   53471.7    38931.0   1.374 0.169615    
## lat47.40 - 47.45   97021.6    40068.0   2.421 0.015470 *  
## lat47.45 - 47.50   93070.2    40942.4   2.273 0.023027 *  
## lat47.50 - 47.55  120873.1    41577.6   2.907 0.003652 ** 
## lat47.55 - 47.60  169214.8    42027.7   4.026 5.69e-05 ***
## lat47.60 - 47.65  247363.5    43406.5   5.699 1.23e-08 ***
## lat47.65 - 47.70  230068.5    44366.6   5.186 2.18e-07 ***
## lat47.70 - 47.75  215981.8    45133.4   4.785 1.72e-06 ***
## lat47.75 - 47.80  189948.4    45685.8   4.158 3.23e-05 ***
## condition2         75954.1    33519.9   2.266 0.023468 *  
## condition3         84927.3    31265.9   2.716 0.006608 ** 
## condition4        110606.6    31289.9   3.535 0.000409 ***
## condition5        162552.9    31473.7   5.165 2.44e-07 ***
## yr_renovated1      64266.3     5621.8  11.432  < 2e-16 ***
## zipcode98002         387.4    14042.9   0.028 0.977990    
## zipcode98003       -4009.5    12776.3  -0.314 0.753658    
## zipcode98004      581026.8    23743.7  24.471  < 2e-16 ***
## zipcode98005      153577.5    24632.8   6.235 4.63e-10 ***
## zipcode98006      143975.4    21462.7   6.708 2.03e-11 ***
## zipcode98007       84513.3    25535.3   3.310 0.000936 ***
## zipcode98008       83360.1    24384.1   3.419 0.000631 ***
## zipcode98010       54937.0    18201.7   3.018 0.002546 ** 
## zipcode98011      -13224.5    28974.8  -0.456 0.648100    
## zipcode98014      -75293.9    28097.7  -2.680 0.007376 ** 
## zipcode98019      -61627.4    28894.3  -2.133 0.032951 *  
## zipcode98022       34072.6    33717.7   1.011 0.312258    
## zipcode98023      -33498.2    11182.9  -2.995 0.002744 ** 
## zipcode98024       56962.3    26493.6   2.150 0.031566 *  
## zipcode98027       94804.6    20680.7   4.584 4.59e-06 ***
## zipcode98028      -17205.6    28245.4  -0.609 0.542435    
## zipcode98029      108626.4    22223.2   4.888 1.03e-06 ***
## zipcode98030        5551.5    16215.0   0.342 0.732078    
## zipcode98031      -14411.7    17173.9  -0.839 0.401391    
## zipcode98032       -1255.8    19548.1  -0.064 0.948780    
## zipcode98033      179587.3    25983.4   6.912 4.96e-12 ***
## zipcode98034       36171.6    27121.4   1.334 0.182322    
## zipcode98038       31301.1    14011.4   2.234 0.025498 *  
## zipcode98039      928698.7    34264.2  27.104  < 2e-16 ***
## zipcode98040      381226.3    22303.7  17.093  < 2e-16 ***
## zipcode98042       -2831.4    13661.1  -0.207 0.835810    
## zipcode98045       54456.0    20428.5   2.666 0.007690 ** 
## zipcode98052       61047.3    25377.4   2.406 0.016157 *  
## zipcode98053       26245.3    25754.5   1.019 0.308191    
## zipcode98055       -2750.3    19430.3  -0.142 0.887440    
## zipcode98056       27545.6    20465.8   1.346 0.178342    
## zipcode98058      -11956.6    18124.7  -0.660 0.509465    
## zipcode98059       31291.6    19913.6   1.571 0.116117    
## zipcode98065       22391.8    21723.1   1.031 0.302656    
## zipcode98070       -3765.3    20653.9  -0.182 0.855345    
## zipcode98072        9780.9    28235.2   0.346 0.729041    
## zipcode98074      -19471.8    24381.0  -0.799 0.424507    
## zipcode98075       38717.0    22365.8   1.731 0.083455 .  
## zipcode98077      -26557.8    28837.5  -0.921 0.357091    
## zipcode98092      -21326.8    11968.5  -1.782 0.074781 .  
## zipcode98102      301006.6    28451.8  10.580  < 2e-16 ***
## zipcode98103      169097.9    25816.1   6.550 5.91e-11 ***
## zipcode98105      306119.8    27120.5  11.287  < 2e-16 ***
## zipcode98106       50726.4    21616.5   2.347 0.018954 *  
## zipcode98107      182195.5    26988.5   6.751 1.52e-11 ***
## zipcode98108       27535.9    23297.5   1.182 0.237252    
## zipcode98109      344538.9    28018.8  12.297  < 2e-16 ***
## zipcode98112      425357.6    25216.5  16.868  < 2e-16 ***
## zipcode98115      164599.3    25870.1   6.363 2.03e-10 ***
## zipcode98116      178734.3    22526.9   7.934 2.25e-15 ***
## zipcode98117      152520.8    25918.5   5.885 4.06e-09 ***
## zipcode98118       73299.3    21040.6   3.484 0.000496 ***
## zipcode98119      296744.3    26359.0  11.258  < 2e-16 ***
## zipcode98122      145489.5    25202.7   5.773 7.93e-09 ***
## zipcode98125       38354.3    27440.9   1.398 0.162219    
## zipcode98126      109243.6    21510.9   5.079 3.84e-07 ***
## zipcode98133        6557.5    27385.3   0.239 0.810756    
## zipcode98136      178382.9    22213.9   8.030 1.04e-15 ***
## zipcode98144      162303.9    22428.3   7.237 4.80e-13 ***
## zipcode98146       29240.0    21014.5   1.391 0.164116    
## zipcode98148       17989.5    25969.6   0.693 0.488499    
## zipcode98155       -9493.0    27686.1  -0.343 0.731693    
## zipcode98166       27681.2    19918.9   1.390 0.164640    
## zipcode98168        2693.5    20735.3   0.130 0.896649    
## zipcode98177       78350.1    28406.9   2.758 0.005819 ** 
## zipcode98178      -23646.2    20999.4  -1.126 0.260163    
## zipcode98188      -14923.4    21528.2  -0.693 0.488192    
## zipcode98198      -20883.2    16449.2  -1.270 0.204261    
## zipcode98199      209708.8    25175.0   8.330  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 144300 on 17128 degrees of freedom
## Multiple R-squared:  0.8463, Adjusted R-squared:  0.845 
## F-statistic: 650.6 on 145 and 17128 DF,  p-value: < 2.2e-16

Unfortunately, after taking out sqtf_basement, the model did not improve.

Then, we fit the model to our dataset to predict the price of every house in the train dataset, to see if they reflect the actual value.

train3_fit <- fitted(model2)
train3$pred <- train3_fit

We have to score our model on the MAPE, which is the mean absolute percentage error. The lowest this number, the better the model.

mape <- function(real, predicted) {
  return(mean(abs((real - predicted) / real)))
}
mape(train3$price, train3$pred)
## [1] 0.1668104

We get a MAPE score of 0.1668104, which is very good for our model.

Last, with our model we predictthe prices in the test dataset.

test3$price <- predict(model2, test3)